Using pmap on Amazon EC2?

Hi everyone - - I have some questions about using pmap. I’m trying to use pmap to run some embarrassingly parallel code on Amazon EC2. I’ve been running into a problem where it seems each worker keeps using up more and more memory, to the point where one of the workers terminates, and the others then stop as well. I’d like to understand how this happens and what I can do to better manage the execution.

Here is an example. I start a t2.large instance (AMI: JuliaPro 0.6.2.1_mkl RHEL 7.4 (ami-01e5a57b)). I then run Julia:

$ JuliaPro-0.6.2.1/julia 

Running top at this stage gives me the following:

10359 ec2-user  20   0  113128   1400   1204 S   0.0  0.0   0:00.00 julia 
10365 ec2-user  20   0  479876 146188  48328 S   0.0  1.8   0:00.40 julia  

I create the workers and dummy functions for pmap:

julia> addprocs(4)
4-element Array{Int64,1}:
 2
 3
 4
 5

julia> @everywhere f = function(x)
               A = rand(x*10000,x*10000)
               return 0 
       end

julia> err_fn = function(e)
               @show e
               @show myid()
       end
(::#7) (generic function with 1 method)

julia> @show nprocs()
nprocs() = 5
5

Running top now shows the Julia processes of the workers:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND        
10359 ec2-user  20   0  113128   1400   1204 S   0.0  0.0   0:00.00 julia
10365 ec2-user  20   0  648556 172100  49648 S   0.0  2.1   0:01.72 julia                                      
10368 ec2-user  20   0  479380 145884  48312 S   0.0  1.8   0:00.44 julia
10369 ec2-user  20   0  578408 147696  48540 S   0.0  1.8   0:00.59 julia
10371 ec2-user  20   0  578312 150800  48512 S   0.0  1.9   0:00.59 julia
10373 ec2-user  20   0  578864 149972  48512 S   0.0  1.9   0:00.57 julia 

When I now run pmap, some of the workers give OutOfMemory() errors, and the process eventually exits:

julia> out = pmap(f, 1:15, on_error = err_fn)
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 2:	e = OutOfMemoryError()
	From worker 2:	myid() = 2
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 2:	e = OutOfMemoryError()
	From worker 2:	myid() = 2
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5
	From worker 2:	e = OutOfMemoryError()
	From worker 2:	myid() = 2
	From worker 5:	e = OutOfMemoryError()
	From worker 5:	myid() = 5






Worker 4 terminated.
ERROR: ProcessExitedException()ERROR (unhandled task failure): EOFError: read end of file

Stacktrace:
 [1] #571 at ./asyncmap.jl:178 [inlined]
 [2] foreach(::Base.##571#573, ::Array{Any,1}) at ./abstractarray.jl:1733
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Base.Distributed.##204#207{WorkerPool}, ::Function, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:154
 [5] #async_usemap#556(::Function, ::Void, ::Function, ::Base.Distributed.##188#190, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./asyncmap.jl:103
 [6] (::Base.#kw##async_usemap)(::Array{Any,1}, ::Base.#async_usemap, ::Function, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64},N} where N) at ./<missing>:0
 [7] (::Base.#kw##asyncmap)(::Array{Any,1}, ::Base.#asyncmap, ::Function, ::UnitRange{Int64}) at ./<missing>:0
 [8] #pmap#203(::Bool, ::Int64, ::Function, ::Array{Any,1}, ::Void, ::Function, ::WorkerPool, ::Function, ::UnitRange{Int64}) at ./distributed/pmap.jl:126
 [9] (::Base.Distributed.#kw##pmap)(::Array{Any,1}, ::Base.Distributed.#pmap, ::WorkerPool, ::Function, ::UnitRange{Int64}) at ./<missing>:0
 [10] #pmap#213(::Array{Any,1}, ::Function, ::Function, ::UnitRange{Int64}) at ./distributed/pmap.jl:156
 [11] (::Base.Distributed.#kw##pmap)(::Array{Any,1}, ::Base.Distributed.#pmap, ::Function, ::UnitRange{Int64}) at ./<missing>:0

julia> @show nprocs()
nprocs() = 4
4

Checking top now: (bolded worker process has a lot of memory)

10359 ec2-user  20   0  113128    196      0 S   0.0  0.0   0:00.00 julia
10365 ec2-user  20   0  713632 170536  15616 S   0.0  2.1   0:03.91 julia
10368 ec2-user  20   0  612844 102656    760 S   0.0  1.3   0:01.33 julia
**10369 ec2-user  20   0 3770456 3.082g    828 S   0.0 40.4   0:02.05 julia**
10373 ec2-user  20   0  711352 106108    764 S   0.0  1.3   0:01.38 julia     

When I run @everywhere gc(), things become more normal:

10359 ec2-user  20   0  113128    196      0 S   0.0  0.0   0:00.00 julia
10365 ec2-user  20   0  713764 154368  15864 S   0.0  1.9   0:04.09 julia
10368 ec2-user  20   0  612844 113224   6304 S   0.0  1.4   0:01.44 julia
10369 ec2-user  20   0  645452 112880   6584 S   0.0  1.4   0:02.12 julia
10373 ec2-user  20   0  711352 111652   6308 S   0.0  1.4   0:01.49 julia

My questions are:

  1. Is there a way to run pmap you would recommend that would be robust to these types of failures?
  2. This is not shown in the example, but I have noticed that when I run my own code, the memory usage of each worker seems to keep increasing, despite trying to invoke the garbage collector by setting an array to 0 and invoking @everywhere gc(). I do not encounter this when trying to run the code locally. Is this normal? Is there a way to completely clear/reset a worker each time it executes the function given to pmap?
  3. (More generally) Why are there two Julia processes when I start Julia on EC2?
  4. (Related to 2) Does Julia tend to use more memory on EC2 than on “normal” machines?

I apologize if these are stupid questions - thank you in advance for any suggestions or information you can provide.